Skip to content

patch: advanced rolling ops using etcd#364

Open
patriciareinoso wants to merge 27 commits intocanonical:mainfrom
patriciareinoso:DPE-9349-rolling-ops
Open

patch: advanced rolling ops using etcd#364
patriciareinoso wants to merge 27 commits intocanonical:mainfrom
patriciareinoso:DPE-9349-rolling-ops

Conversation

@patriciareinoso
Copy link
Copy Markdown

@patriciareinoso patriciareinoso commented Mar 16, 2026

Context

All the code on this PR is new

This implementation is based on DA241 Spec

  • The content of charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py belongs to another library that is currently being migrated to charmlibs so you can ignore it for now.

Summary

This PR is the first part of the implementation of advanced rolling ops at cluster level.

This PR includes:

  • Management of the client certificate used to connect to etcd
    • The leader unit creates a self-signed certificate with live of 50 years
    • Share the certificate with the other units using a peer relation
  • Implementation of the integration with etcd
    • Share the mtls certificate
    • Observe the resource_created event
    • Observe the endpoints_changed event
  • Management of a env file needed to connecto etcd via etcdctl

This PR does not implement the locking mechanism. In here we only test that we can connect to etcd from each unit.

Current workflow:

  1. The unit make a request
  2. A new background process is spawn
  3. The background process dispatches a Juju hook
  4. The unit observes that hook
  5. The unit writes and read a value in etcd
  6. If the unit was able to connect to etcd, it executes the "restart" function.

This is a very simplified workflow to be able to test that the units from different apps can reach etcd.

To do

  • Implement the actual locking mechanism
  • Figure out how to properly install etcdctl

@james-garner-canonical
Copy link
Copy Markdown
Collaborator

james-garner-canonical commented Mar 16, 2026

Hi @patriciareinoso thanks for opening this PR. I haven't looked through the code yet, but I wanted to start with a request for some additional context first.

Please update the PR to describe whether this is a migration of an existing codebase or should be treated as all new. If it's a migration, please link to the existing library, and describe what's changing/open to change. If there was a spec involved in the design of this new lib, please also link to it.

You've probably already seen it, but I just want to also drop a link to our migration guide https://documentation.ubuntu.com/charmlibs/how-to/migrate/

If you'd like to talk about any of this synchronously, feel free to set up a meeting, my availability is in my calendar. Otherwise we'll proceed with asynchronous review when we have the context.

PS: If the PR isn't ready for review yet, you can mark it as a draft and manually re-request review when ready.

Copy link
Copy Markdown
Contributor

@Gu1nness Gu1nness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions around security and try/catch errors.
Excellent work!


ca_key = PrivateKey.generate(key_size=4096)
ca_attributes = CertificateRequestAttributes(
common_name='rollingops-client-ca', is_ca=True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be common_name ? Or something decided by the charm integrating ?
Or is it because it can't be chosen in a distributed way ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to Neha's comment on the common_name, it might be worth adding add_unique_id_to_subject_name=False to the attributes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning to this file but it's a more generic question: Is is possible for a unit to request a lock for another unit by "hacking" the etcdctl calls that request a lock ? What would be the consequences ?
Is there a way to protect from that ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a unit can hack the etcdctl to request a lock for another unit. In the app-level solution we have isolation in databag (a unit cannot write in another's unit databag) but here we do not have this protection.

The only protection we have is: a unit cannot request a lock for a unit in another cluster ID which is done using prefixes and is automatically handled by etcd charm.

It would be nice to have an extra prefix assignment to a given unit, so that this cannot happen. But I don't think there is a straightforward (already implemented) way to do it.

if we are able to get the certificate and connect to etcd, any unit can make another unit execute an operation known by the attacked unit which would not be very nice.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, two things:

  • I'm afraid this will get abused by developers to trigger restarts/actions in other units/apps. And as of now, I don't see an easy way of getting around this either (I have some ideas using asymmetric cryptography but that would be a significant effort to bring).
  • We must ensure that the ids are secure enough and cannot be guessed because that could be a real issue: one compromised application/unit in a model could lead to a full DoS of the model (and we really don't want that).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question:
What would happen if I where to connect to etcd with my client app, requesting the prefix /rollingops.
Would I be able to list and export all locks that are requested by every application ?
It's probably something that etcd has to take care about, but I'm curious 👀

os.chmod(cls.ENV_FILE, 0o600)

@classmethod
def load_env(cls) -> dict[str, str]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function scares me.
I trust that it probably works but it does scare me that we're trying to parse some exports that way.
Why do we need to go through a dedicated file to do that ?
It seems to me that we're writing to the file just to read it as an environment provider for the command runner.

Isn't there another way to do that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters that are passed as env=cls.load_env() to the subprocess command could instead be appended to the command itself (example).

This would eliminate the need to write and load an env file, but would require access to relation data at runtime.

Copy link
Copy Markdown
Author

@patriciareinoso patriciareinoso Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be running etcd commands from the background process, meaning that we won't have access to the relation data. That's why we need to write this information in a file.
I changed and now we write just the configuration in json format

"""Stop the etcd worker process in the current unit."""
unit = event.departing_unit
if unit == self.model.unit:
self.worker.stop()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to be sure:

  • worker spawns one process and dies
  • Process triggers one dispatch and dies
  • Dispatch from the process prevents this event (relation-departed) from running (because only one event at a time).
    Is it to cover race conditions ? Or are the workers incomplete for now ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker is incomplete for now. It may run for a "long time" waiting for the lock to be granted. So when the etcd relation is removed we should stop it.

@Gu1nness
Copy link
Copy Markdown
Contributor

Also I think you need to add a changelog file (cf https://documentation.ubuntu.com/charmlibs/explanation/charmlibs-publishing/ )

james-garner-canonical added a commit that referenced this pull request Mar 17, 2026
#364 has revealed that CI breaks with packages with a `requires-python`
lower bound that's higher than 3.10 (it has `requires-python>=3.12`).

This is because we would run `just integration-k8s` and `just
integration-machine` in the `init` step to check if they have any tests,
and that uses our default `justfile` Python version of 3.10.

This PR fixes this by recording the minimum supported Python version
when we calculate the Python versions to test the library with, and
using that version in the failing step.
@james-garner-canonical
Copy link
Copy Markdown
Collaborator

CI failure should be fixed in #366

@james-garner-canonical
Copy link
Copy Markdown
Collaborator

Integration tests should be fixed for real now with #373 merged, please merge main to pull in the fix.

Copy link
Copy Markdown

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the implementation looks good for the scope of the PR. I've left a few comments, mostly about error handling, and a few questions.


ca_key = PrivateKey.generate(key_size=4096)
ca_attributes = CertificateRequestAttributes(
common_name='rollingops-client-ca', is_ca=True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to Neha's comment on the common_name, it might be worth adding add_unique_id_to_subject_name=False to the attributes.

os.chmod(cls.ENV_FILE, 0o600)

@classmethod
def load_env(cls) -> dict[str, str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters that are passed as env=cls.load_env() to the subprocess command could instead be appended to the command itself (example).

This would eliminate the need to write and load an env file, but would require access to relation data at runtime.

Comment on lines +93 to +99
try:
with_pebble_retry(lambda: BASE_DIR.mkdir(parents=True, exist_ok=True))
with_pebble_retry(
lambda: CONFIG_FILE_PATH.write_text(json.dumps(asdict(config), indent=2), mode=0o600)
)
except (FileNotFoundError, LookupError, NotADirectoryError, PermissionError) as e:
raise RollingOpsFileSystemError('Failed to persist etcd config file.') from e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should catch the RetryError from tenacity here as well ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these error need manual intervention I don't think that tenacity would help much. I would just bubble that up


def is_etcdctl_installed() -> bool:
"""Return whether the snap-provided etcdctl command is available."""
return shutil.which(ETCDCTL_CMD) is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it happen that this call is executed multiple times in the same event?
If yes, we probably want to cache that result.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could happen.
do we agree that the cache will be cleaned at every event?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's in python's process

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does those changes mean that the template is generating something that doesn't make sense?
If yes, it's worth opening an issue

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily.
Juju.wait by default waits for 3 minutes, which depending on your charm may not be enough.
As for juju.debug_log. I set the limit to 0 to be able to see the debug-log after integration test failure.
The charmtech team is working to be able to export the logs after the workflow run.

Copy link
Copy Markdown
Contributor

@Gu1nness Gu1nness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, nearly there for me!
Mostly a few more questions.
The only blocking point for me we need to discuss/fix is the security part to prevent DoS

Copy link
Copy Markdown
Contributor

@Gu1nness Gu1nness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's all good to me, just the issue on the potential hacking part to be flagged as TODO to improve security.

Copy link
Copy Markdown

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Patricia!

@james-garner-canonical
Copy link
Copy Markdown
Collaborator

Per our chat just now, the plan is for @patriciareinoso to create a DPE-9349-maintenance* or similarly named branch in her fork and merge this PR branch into it. Then the remaining feature work can be reviewed by Data folks as PRs against that branch. When the library is ready for a release, a PR from that feature branch in the fork can be made against canonical/charmlibs:main.

In the meantime, the library can be tested in charms by using a git dependency.

Charm Tech won't provide code review of library internals, just the public API, so there shouldn't be any duplicate review following this process. The UX in the spec looks good to me, and the overall approach seems very reasonable, so I don't anticipate major issues with the final review.


* the maintenance suffix is a workaround so CI triggers on PRs to the branch ...

@james-garner-canonical
Copy link
Copy Markdown
Collaborator

Regarding the library version, I think a 0.X version makes sense for the initial release if you anticipate the API evolving in potentially breaking ways after some use in the mongo charm. When you're happy with the shape of the evolved API, it would be good to lock in a 1.0 release and advertise the library more widely.

@james-garner-canonical
Copy link
Copy Markdown
Collaborator

PS: I've switched the integration tests to run on ubuntu-22.04 on main, so you can avoid changing the global workflow files in your PRs (which triggers CI for all libs).

Copy link
Copy Markdown

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concerns from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants